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Abstract 

When conducting Bayesian inference, delayed acceptance (DA) Metropolis-Hastings 
(MH) algorithms and DA pseudo-marginal MH algorithms can be applied when it 
is computationally expensive to calculate the true posterior or an unbiased estimate 
thereof, but a computationally cheap approximation is available. A first accept-reject 
stage is applied, with the cheap approximation substituted for the true posterior in 
the MH acceptance ratio. Only for those proposals which pass through the first stage 
is the computationally expensive true posterior (or unbiased estimate thereof) evalu¬ 
ated, with a second accept-reject stage ensuring that detailed balance is satisfied with 
respect to the intended true posterior. In some scenarios there is no obvious com¬ 
putationally cheap approximation. A weighted average of previous evaluations of the 
computationally expensive posterior provides a generic approximation to the poste¬ 
rior. If only the A:-nearest neighbours have non-zero weights then evaluation of the 
approximate posterior can be made computationally cheap provided that the points at 
which the posterior has been evaluated are stored in a multi-dimensional binary tree, 
known as a KD-tree. The contents of the KD-tree are potentially updated after ev¬ 
ery computationally intensive evaluation. The resulting adaptive, delayed-acceptance 
[pseudo-marginal] Metropolis-Hastings algorithm is justified both theoretically and em¬ 
pirically. Guidance on tuning parameters is provided and the methodology is applied 
to a discretely observed Markov jump process characterising predator-prey interactions 
and an ODE system describing the dynamics of an autoregulatory gene network. 

Keywords: Delayed-acceptance; surrogate; adaptive MCMC; pseudo-marginal MCMC; KD- 
tree. 


1 Introduction 


A major challenge for Bayesian inference in com plex statistical models is th at evaluation 
of the likelihood or, for pseudo-marginal MCMC (lAndrieu and Robertsl. 120091) . obtaining a 
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realisation from an unbiased estimator of the likelihood, can be computationally expensive. 
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In this paper we propose to use a relatively generic surrogate for models with expen¬ 
sive likelihoods and we justify its use in adaptive, delayed-ac ceptance (pseudo-margin al) 
MCMC schemes. The delayed acceptance MCMC algorithm of IChristen and FoxI (120051) is 
a two-stage Metropolis-Bastings algorithm in which, typically, proposed parameter values 
are accepted or rejected at the first stage based on a computationally cheap surrogate for 
the likelihood. Detailed balance with respect to the true posterior is ensured by a second 
accept-reject step, based on the computationally expensive likelihood, for those parameter 
values which are accepted in the hrst stage. Delayed acceptance algorithms thus provide 
draws from the posterior distribution of interest whilst potentially limiting the number of 
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For some models there may be an obvious cheap surrogate. For example. 


Golightlv et al. 


(1201511 use both the diffusion approximation and the linear noise approximation as surrogates 
for a Markov jump process in the context of analysing stochastic kinetic models. For many 
models, however, there are no obvious model-based candidates for the surrogate. It is natural 
in such scenarios to use regression-based methods which utilise previous evaluations of the 


computationally expensive likelihood to approximate th e likelihood or t 
posterior density at ne w parameter value s. F or example. 
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(1201111 use Gaussian processes 
(GPs). In this paper, we f ocus on a ge n eric s urrogate based upon likelihood values at the 
fc-nearest neighbours (e.g. iHastie et al.l . 120091) . It is easy to implement, computationally 
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cheap, and it adapts as new evaluations of the computationally-expensive likelihood become 
available. In Section 15.11 we consider the merits and disadvantages of an alternative, GP- 
based solution. 

We approximate the likelihood at proposed parameter values by an inverse-distance- 
weighted average of the likelihoods of its fc-nearest neighbours in the training data; the 
training data here consist of pairs of parameter vectors and their corresponding likelihoods. 
This fc-NN approach has the advantage of being simple, local, flexible to the shape of the 


likelihood, and trivially adaptive as new training da 
adapting a local approximation is thus similar to that in 
and contrast the two approaches in Section 15.11 
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control of the adaptation is therefore essential and we prove that, subject to conditions, the 


strategy we pro pose is theoretically va 


algorithms (e.g. 


Roberts and Rosenthai 


i d. As with a number of pr evious adaptive MCMC 


2009 


Sherlock et ah 


2010 ). our MCMC kernel is a 


mixture of a hxed kernel and an adaptive kernel. Unlike previous such algorithms, however, 
the adaptive kernel can be most efficient when the fraction of applications that involve a 
computationally expensive evaluation is very low; this impacts on the rate of adaptation 
and on the rate of convergence. We therefore also provide theoretically-justihed guidance on 
choosing both the mixture probability and the total number of iterations of the algorithm. 

The main computational expense of the /c-NN approximation is searching for the near¬ 
est neighbours. In a naive implementation this search takes 0{n) operations, where n is 
the number of training datapoints, so the computational expense of an adaptive, nearest- 
neighbour-based approach would grow linearly with the length of the MCMC run. Fortu¬ 


nately, there are more efficient algorithms. We use an approa ch basec 


data in a multi-dimensional binary tree, known as a KD-tree flBentlevi . 


on storing the training 


1975 


Friedman et al. 


19771) . The ‘K’ in ‘KD-tree’ indicates the dimension of the space, but to avoid confusion with 


the number of nearest neighbours k, we denote the dimension of the space by d. Because 
our tree grows on-line as new training data become available we make two major changes 
to the standard KD-tree algorithm, described at the start of the next section. Our adapted 


3 












































KD-tree algorithm allows us to efficiently search for the nearest neighbours in approximately 
0{d\ogn) operations and add an additional value to the training set also in 0{d\ogn) op¬ 
erations. This yields a highly efficient, adaptive surrogate. 

The remainder of the paper is structured as follows. Our KD-tree fc-NN algorithm is 
described in Section [2] and its use in adaptive, delayed-acceptance (pseudo-marginal) MCMC 
is discussed in Section El This section also provides the theoretically-justified guidance on 
the choice of kernel mixture probability and number of iterations. Section 0] applies the 
methodology to a discretely observed Markov jump process characterising predator-prey 
interactions and an ODE system describing the dynamics of an autoregulatory gene network. 
The paper concludes in Section [5] with a discussion. 


2 The KD-tree /c-nearest neighbour algorithm 


Suppose the true parameter is 6^ G We wish to find a cheap approximation tTc^O) to 
IT{6) using a weighted average of the expensive values that have already been calculated. 
These expensive values might be of the true posterior vr(6 *i),..., 7r(9m) (involving, for exam¬ 
ple, the numerical solution of a number of differential equations as in Section 14.2p or the 
expensive values might be 7rs(6*i),..., TisiOm), unbiased stochastic estimates obtained as part 
of a pseudo-marginal MCMC algorithm fSection 14.ip . 

We will average the fc-nearest neighbours. Naive use of a vector of n 6 values and 
associated log-likelihoods, v, is expensive. While adding to the list takes 0(1) operations, 
searching the list for the k nearest neighbours to a particular point, 6*, takes 0{n) operations. 

For our applications, typically the dimension, d, of the problem is moderate: between 3 
and 1 2. For very low dimensional problems an analogue of the quadtree flFinkel and Bentlevi . 
I 974 J ) would be the most efficient approach, but the number of pointers from each node grows 
exponentially with dimension. 


We therefore use a variation on the KD-tree flBentlevl . 
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Friedman et ah 


19771) • Cre¬ 


ation of the tree from n values takes 0{dn\ogn) operations and requires storage of 0{n). 
For a balanced (see Section 12.71) tree the computation required to add an additional value is 
O(dlogn), and to search for a nearest neighbour is also O(dlogn). 

The standard KD-tree has a single item of data {9 value and associated information) at 
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each node; all items further down the tree from this node have been split at this node as 
described in detail in Section 12.41 The standard structure assumes that all of the 9 values to 
be used are available before the tree is constructed, whereas our tree potentially grows with 
each new position at which the log-likelihood (or an unbiased estimate thereof) is evaluated. 
The most efficient ‘splitting point’ of any node of a tree is the median of all relevant values. 
Use of the median leads to a balanced tree where all end nodes would be at approximately 
the same depth. However we do not know the true median. We therefore separate our tree 
in to leaf nodes and branch nodes. Branch nodes provide the splitting information and leaf 
nodes, which occur at the base of the tree, store multiple data values. When a leaf node 
becomes full, it splits according to the median of the relevant data values it contains, rather 
than a true median, and becomes a branch node, creating two leaf nodes beneath it. The 
maximum size of a leaf node is defined so that the leaf median is unlikely to be ‘too far’ 
from the true median; the choice of this tuning parameter is investigated in Section 12.71 and 
in the simulation study in Section 14.11 

2.1 Preliminary run 

To estimate the likelihood at some new point, we will take a weighted average of likelihoods, 
or estimated likelihoods at the k nearest points. If the likelihood varies more quickly with 
distance along some axes than along others then ‘nearest’ should be according to some 
alternative metric such as a Mahalanobis distance. However we also divide the KD-tree 
along hyperplanes which are perpendicular to one of the Cartesian axes. This leads to gross 
inefficiencies when there is a strong correlation between components of 9. For example if 
9 has a bivariate Gaussian distribution with marginal variances of 1 and correlation 0.999 
then the hyperplane splits in a tree of depth 5, say, effectively partition the first principal 
component but fail to partition the second, so that two 9 values which are reasonably close 
according to the Mahalanobis distance can be in different portions of the tree. This problem 
is exacerbated in higher dimensions. 

The algorithm should, therefore, be more efficient if the the parameters are relatively 
uncorrelated and if the length scales in the direction of each axis are similar. A preliminary 
run of the MCMC algorithm allows us to find an approximate center fi, and variance matrix 
S. For all 9 we then define := {9 — fi) to normalise. Since this transformation gives 
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a one-to-one mapping from 0 to "0, in what follows, for notational simplicity, we refer to the 
parameter values to be stored in the KD-tree as 6 and implicitly assume that, in practice, the 
transformation has already been applied. The preliminary run should be of sufficient length 
that the sample approximately represents the gross relationships in the main posterior and 
hence that Euclidean distance is a reasonable metric for the transformed parameters; the 
preliminary chain does not need to have mixed thoroughly. 

2.2 Tree preliminaries 

The KD-tree stores a large number of vector parameter values, 0 G 0, each with an associated 
vector of interest, vq G V, in such a way that the time taken to either update the tree or 
to retrieve the information we require is logarithmic in the number of {6, ve) pairs that are 
stored in the tree. In our case vq = {U^no) G M x N is the logarithm of the average of all 
estimates of the posterior at parameter values close to 9 and the number of such estimates. 

Associated with the vector of interest is a merge function M:0xVx0xV—)■¥, 
which combines the current vector of interest with the vector at a new value, 9*. 

2.3 Tree structure 

The tree consists of branch nodes and leaf nodes. Each branch node has two children, 
each of which may either be a branch node or a leaf node. 

Leaf nodes. Each leaf node stores up to 26 — 1 {9,ve) pairs, and the dimension, dsput G 
{!,..., d}, on which the node will split. 

Branch nodes. Each branch node stores a component, dspiu G {1,..., d} on which it 
was split, and a corresponding scalar value dgpm, the split point. The left-hand child of the 
node contains (if it is a leaf node) or points to (if it is a branch) all {9, v) pairs that have 
passed through this branch node and that have < ^sput- The right-hand child contains 
or points to all (9,v) pairs with 9^^^^.^ > 9spiit- If ~ split h niay be contained by the 
right hand node or the left hand node. 

Root node. The node at the very top of the tree is called the root node. By default the 
root node has dgput = 1. If there are fewer than 2b leaves in the tree then the root node is a 
leaf node, otherwise it is a branch node. 
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2.4 Adding data to a tree 

When a leaf has 2b entries it immediately spawns two leaf node children by splitting all 2b 
entries along the component, dspiu- The splitting point, Osput, is the median of the 2b values 
for Odspiit- The parent node then becomes a branch, and the leaf nodes inherit dspiu = 
where is the component on which the parent has just been split, and © represents regular 
addition except that d © 1 = 1. 

Some initial number, uq of data points are used to create a balanced tree using the 
standard recursive procedure given in the supplementary material (Appendix A.l). After 
this, a single new entry [6, Ig) is added to the tree by descending from the root node and, at 
each branch, comparing component with Ogput to choose the relevant child (if = 
6split then the left child is chosen with probability 0.5). The new entry is added to the 
leaf node that is reached. If this leaf node still has fewer than 2b entries then the algorithm 
stops, otherwise the leaf-node becomes a branch and spawns two child leaf-nodes as described 
above. 

2.5 Searching the tree 

We apply a standard two-stage recursive algorithm to find the r nearest neighbours of a given 
point, 9*, together with their distances from 9*. We provide an overview of the algorithm 
below; the procedure is detailed in full in the supplementary material (Appendix A.2). 

The algorithm hrst descends the tree to find the leaf node to which 9* would belong, as 
described in Section 12.41 It then gradually ascends the tree from this node to the root node; 
after each ascent from a node to its parent a test is conducted to see whether the other child, 
that is the child of the current node from which the algorithm has not just ascended, or any 
of its offspring might hold a closer neighbour than the current k nearest. If this is the case 
then, before any further ascent can take place, the search descends down the tree via this 
other child. 

2.6 Restricting the growth of the tree 

The tree will be used to obtain computationally cheap estimates of the posterior density. 
We would like to ensure that the accuracy of the approximations increases with the amount 
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of information stored in the tree, whether the information arises from exact evaluations of 
the posterior or from stochastic estimates. We would also prefer the cost of obtaining this 
information to increase slowly, if at all, with the amount of information. 

To reduce the total number of leaves we set a minimum distance between leaves, e. For 
any new information, {6*,vg*), to be added to the tree we first ascertain whether or not any 
pairs {0,vg) exist with 116^* — 0|| < e. If one or more such pair exists then the merge function, 
M, described in Section Y272\ is used to combine {9*,vg*) with the nearest pair, providing a 
replacement value for vg; otherwise {9*,vg*) is added to the tree as described in Section ITTI 
When each vg* contains an exact evaluation of the posterior then the merge simply ignores 
the new information. However, when vg* contains a stochastic estimate of the posterior some 
weighted average of the new estimate and of the current average will be more appropriate 
since, by the continuity of the posterior, for sufficiently small e, 7r{9*) ~ n{9) for all 9* such 
that ||0* — 0|| < e. In particular therefore, for pseudo-marginal algorithms, we define 

Mpm{ 9, [lg,ng],9*, [Ig., 1]) := [log [uge'^^ + e'®*] - log(?7,0 1), ng + l]. 

This vector replaces the previous [lg,ng] vector and so is associated with the position 9. 

Choice of merge distance 

Consider a tree with n existing points, 9i,.. .9n and to which it is proposed that a new point 
9*, chosen at random, will be added. We relate the merge distance, e, to the probability, 
Pkeep, that none of the existing points is within the e ball of 9*, so that 9* will be added to 
the tree. This then provides a guide to setting e itself. 

Let B* be the e ball around 9*, dehne to be the number of the n existing points that 
are inside and consider = IE[iV„,e]- The following is proved in the supplementary 
material (Appendix B). 

Proposition 1. //O < < 1 then 1 — E^^e < Pkeep < 

To use Proposition [T] we require an expression for En^e, and this depends on the distri¬ 
bution of ( 6 ^ 1 ,..., 9m 9*). For tractability, and because it will often hold approximately with 
reasonably sized data sets, we suppose that the target is Gaussian, so that 9 

i N{p,E), 

marginally. This is then normalised (see Section 12.ip so that the following result (see again 
Appendix B for a proof) can be applied. 







Proposition 2. Let jointly distributed 9i ~ 7V(0, J^), ..., ~ N{0,ld), be independent of 

9* ~ N{0,ld). Then En,e = I; where F^2 is the cumulative distribution function of 

a Xd random variable. 

In the simulation studies of Section 0] we choose e such that = 0.5, giving 0.5 < 
Pkeep < 0.61 and e ~ \J‘^^Xd 'where is the quantile function of a Xd random 

variable. 


2.7 Ensuring that the tree remains balanced 


We consider two mechanisms through which a KD-tree that is constructed on-line using an 
MCMC algorithm may become unbalanced, and for each problem we provide a solution. 

When d = 1, there are two binary t ree structure s that allow for online rebalancing: 
the red-black tree and the AVL tree (e.g. IStorerl . 120021 ). Unfortunately there are no known 
algorithms for rebalancing a KD-tree online and so we consider an alternative which ensures 
that the KD-tree remains approximately balanced as at grows. 

The splitting hyper-planes dehne a partition of 0. Let the node box corresponding to 
a particular node be the (possibly unbounded) subset of 0 dehned through the constraint 
at each splitting hyper-plane on the journey from the root node to the node in question, as 
described in Section [2^ 

Consider, informally, for any node box, an ‘effective width’ along a particular co-ordinate 
axis to be some representative width such as the standard deviation of the posterior restricted 
to the node box. Suppose that the MCMC chain is currently in a node box where an effective 
width along its splitting co-ordinate is 6. Now, imagine that at each iteration the chain barely 
moves compared to 6, and each jump proposal - which can give a new evaluation of the log- 
likelihood even if the chain does not move - is also small compared to 6. In this case all 
of the new samples for a large number of iterations will descend to this one particular leaf 
node, which will hll up and then split. This split will not, however, be representative of 
the marginal median (in terms of the posterior) for the node box along the splitting axis. 
Hence, over the rest of the MCMC run, once the chain has moved on, there will be one side 
of the split which takes most of the future sample points and one side which takes very few 
of them, leading to an unbalanced tree. 
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2 b 

[0.4,0.6] 

[0.3,0.7] 

[0.2,0.8] 

d = 3 

d=10 

10 

0.49 

0.15 

0.02 

18.3, 14-23 (11-25) 

18.4, 14-23 (11-28) 

20 

0.35 

0.05 

0.002 

17.7, 15-21 (13-23) 

17.7, 15-21 (12-23) 

30 

0.26 

0.02 

0.0002 

17.5, 15-20 (14-22) 

17.5, 15-20 (14-21) 

40 

0.19 

0.007 

0.00001 

17.4, 15-19 (14-21) 

17.4, 15-19 (13-20) 


Table 1: Left: the probability that an estimated median from a sample of size 2b G 
{10,20,30,40} will be at a quantile outside of the range [0.4,0.6], [0.3,0.7] or [0.2,0.8]. 
Right: tree depths when 2b x 10® independent entries are added sequentially to a tree: mean 
over all leaf nodes, range of the central 99% of leaf nodes and the maximum and minimum. 


Now suppose that a preliminary run of tiq iterations has been carried out and that 
over this run the chain has been seen to mix reasonably across the posterior. Let us now 
construct a balanced tree from this run. Each leaf node will have b* (or b* + 1) entries, for 
some b* G {b,b + 1,... ,2b — 2}. Consider the node box of any specihc leaf node in this tree. 
Since the chain has mixed reasonably, it should represent the posterior within this node box, 
and in particular (1) in the component over which the leaf node will split, the median should 
be reasonably approximated, and (2) in the main MCMC run, the chain should also cover 
this box in approximately b* iterations or fewer; hence it will also cover any sub-divisions of 
the box in approximately b* (or fewer) iterations. 

The second reason a tree can become unbalanced is Monte Carlo error. A leaf splits after 
it has 2b entries by Ending the median of over the 2b entries, but this sample median 
will not be the true median over the node-box. Table [T] provides, for 4 difierent values of 
2b, the probability that the estimated median will be at a true quantile which is outside the 
range shown. A tree with 2b x 10® iid entries was simulated for each value of 2b and each 
of two dimensions. The table also shows the mean depth of the leaf nodes and the range of 
depths of 99% of the leaf nodes and, in brackets, of all leaf nodes. Both aspects of Table 
[U suggest that 2b = 20 or 2b = 30 should lead to a reasonably balanced tree with few leaf 
nodes requiring much more efiort to reach than the majority of the leaf nodes and with little 
efiect on the overall mean amount of efiort required to reach a leaf node. 
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3 Adaptive MCMC algorithm 


We briefly review the delayed-acceptance algorithm before describing our adaptive version. 
This is further extended to an adaptive pseudo-marginal version in Section 13.41 


3.1 Delayed-acceptance algorithms 


Given a current parameter value, 6 E Q, the Metropolis-Hastings (MH) algorithm proposes 
a new value, 6* from some density q{0*\6) and then accepts or rejects according to 

7r{e*)q{e\e*) 


0*) ;= 1 A 


'n'{0)q{6*\6) 


( 1 ) 


The delayed acceptance Metropolis-Hastings (daMH) algorithm utilises a cheap (deter¬ 
ministic or stochastic) approximation tTc in two stages. At Stage One, tTc is substituted for 
71 in the standard MH acceptance formula: 




77,(9*) q {9\9*) 
frMq{9*\9)^ 


( 2 ) 


A second accept/reject stage is applied to any proposals that pass Stage One and a proposal 
is only accepted if it passes both stages. The Stage Two acceptance probability is: 


MO, 9*) 


7i{9*)nc{9) 


(3) 


The overall acceptance probability, ai{9, 9*)a2{9, 9*) ensures that detailed balance is satished 
with respect to vr; however if a rejection occurs at Stage One then the expensive evaluation 
of 7i{9) at Stage Two is unnecessary. 


3.2 Adaptive, delayed-acceptance algorithm 


As in 


Roberts and Rosenthal! (120071. l2009l) our adaptive kernel consists of a mixture of a hxed 


kernel and an evolving kernel. At each iteration, with a user-dehned probability (3 G (0,1), 
the hxed kernel is selected, otherwise the adaptive kernel is used. Our hxed kernel is a 
standard Metropolis-Hasting kernel and our evolving kernel uses delayed-acceptance with 
an acceptance rate derived from a cheap approximation tt^*) which is an inverse-distance- 
weighted average of the expensive evaluations of the true posterior at the k nearest neighbours 
to 9* in the KD-tree at iteration n. After iteration n — 1 let there have been in-i evaluations 
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Our adaptive 


of the true posterior, vr. Let these evaluations be at values 6 **^,..., 
algorithm requires a sequence of probabilities, {pjjjgFj, with 

lim Pi = 0. (4) 

2—>-CX) 

In practice, the algorithm proceeds until some ritot iterations have been performed. 
Algorithm 1: adaptive-KD-tree, delayed-acceptance Metropolis-Hastings. 

1. With probability jS go to Step [2] (MH) else go to Step [3] (da-MH). 

2. MH: Propose 9* from q{d*\d). Evaluate the expensive posterior, vr(0*), and accept the 
proposal {6 ^ 9*) with probability given by ([I]); otherwise reject the proposal (6* ^ 0). 
Go to Step m 

3. da-MH; Propose 9* from q'{9*\9). 

(a) Stage 1: Evaluate ^a{9*) using the current KD-tree; with probability ai{9, 9*) as 
defined in ([2]) proceed to Step [3b] (Stage 2); otherwise reject the proposal {9 ^ 9), 
set in = in-i, go to next iteration. 

(b) Stage 2; Evaluate vr(6 **); accept the proposal {9 ■(— 9*) with probability a2,s{9, 9*) 
as defined in ([3|); otherwise reject the proposal (9^9). Go to StepjU 

4. Set in = in-i + 1; add {9*,n{9*)) to a list of recently-evaluated parameter/posterior 
pairs; with probability Pi^ transfer all pairs from this list to the KD-tree; go to next 
iteration. 

3.3 Delayed acceptance random walk Metropolis 

It remains to choose the proposal mechanisms, q and q'. The Random Walk Metropolis 
(RWM) is a MH algorithm where q{9*\9) = q{\\9* — ^H) for some suitable norm, and hence 
q{9*\9) and q{9\9*) cancel in the acceptance ratios ([ID and ([2D. We consider the standard 
choice of 

q{9*\9) = N{9*-9,V) and q'{9*\9) = q^{9*\9) := N{9*-9,^^V), (5) 

where N{-]9,V) denotes a multivariate Gaussian density with mean 9 and variance V and 
where V has been chosen so as to approximately optimise the efficiency of the standard 
RWM algorithm. 
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As we shall discover, the cheap approximation, tTc is reasonably accurate. As ^ increases 
from 1 the overall acceptance rate and, in particular, the Stage One acceptance rate, ai, 
can decrease quite substantially. If all computational expense is negligible except for the 
evaluation of the true posterior then for a given amount of computational effort, the number 
of evaluations of the expensive posterior remains approximately constant, although the total 
number of iterations of the algorithm increases in proportion to the reciprocal of ai. However, 
as each proposed jump is larger, moves which are accepted at Stage Two are typically larger. 
Thus, provided the Stage Two acceptance rate does not decrease too drastically, the mixing of 
the algorithm (in terms of movement per CPU second) can, and often doe s, actually increase 
for intermediate values of £ . This heuristic has been noted before (e.g. IChristen and Foxl. 


2005 


Banterle et ah 


2OI5I) and also applies for the delayed-acceptance pseudo-marginal 


RWM. A rigo rous analysis o 


is provided in 


Sherlock et ah 


the be haviour of these algorithms as a function of the scaling 

fcoibak 


3.4 Adaptive, delayed-acceptance, pseudo-marginal algorithm 

We hrst overview pseudo-marginal Metropolis-Hastings algorithms and then describe the 
adjustments to the set-up required for the pseudo-marginal version of our algorithm. The 
algorithm itself is provided in the supplementary material (Appendix C). 

The pseudo-marginal algorithm uses a non-negative stochastic estimator 7rs{0] Z) of the 
posterior vr(0), where Z is a collection of random variables whose distribution may, and 
usually does, depend on 6. Crucially, we require E [TtsiO] Z)] = C7r(d), where c is hxed and 
non-negative. We may therefore rewrite ^s(^) as 7r(0)W with lU G W C [0, cxo) sampled 
from some density qe{w), and 

poo 

E[1U]= / wqg{w) dw = c. (6) 

Jo 

The pseudo-marginal MH (PsMMH) algorithm is simply a Metropolis-Hastings Markov chain 
acting on the extended statespace 0 x W with a target of 

tt{6,w) =-7r{e)qg{w)w. (7) 

c 

This has the required marginal for 6^ by (|6]). Detailed balance is ensured with respect to this 
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target by setting the probability for {9*,w*) being accepted to: 


apM{[e,wi[e\w*]) 


ir(e>)q(e\e')w 

Tis{9)q{6*\6) 7r{6)q{6*\6)W 


( 8 ) 


When delayed acceptance is implemented with a pseudo-marginal framework, the Stage 
One acceptance probability is exactly as in ([2]). In Stage Two the true posterior in ([3]) is 
replaced with the realisation from the unbiased estimator: 




7ls{9*)7lc{0) 

uo)uo*y 


(9) 


The kernel P is now a hxed PsMMH kernel on 0 x W and {P-y}^gg is now a set of 
pseudo-marginal kernels on 0 x W. The common stationary density of all kernels is now 
given in Q, so, very importantly, all kernels use the same mechanism for generating the 
estimate ^s{9*) of the posterior at the proposed value for 9. 

The algorithm proceeds as for the non-pseudo-marginal version except that is substi¬ 
tuted for TT in ([1]) (hxed kernel), and ([3]) is replaced with ([9]) (evolving, DA kernel). Naturally, 
instead of storing evaluations of vr the KD-tree now stores realisations of the unbiased ap¬ 
proximation, TTg. 


3.5 Theory and guidance 

We show that, subject to conditions, our algorithms (with general proposals, q) are ergodic. 
We also provide guidance on choosing the probability of using the hxed kernel, (3, and on 
the number of iterations for which the algorithm should be run. 

Dehne o.mh{9,9*) as follows. For the adaptive KD-tree daMH algorithm of Section 
13.21 an4u{9. 9*) is the acceptance probability for the hxed kernel as given in ([1]). For the 
pseudo-marginal version of the algorithm in Section 13.41 it is the acceptance probability for 
an hypothetical, idealised version of the hxed, pseudo-marginal kernel where the posterior 
is known exactly, up to a hxed multiplicative constant: aMH{9,9*) = apM{[9,l],[9*,1]), 
where apM is dehned in ([8]). We require a minorisation condition and, for the daPsMMH 
algorithm, and additional assumption of uniformly bounded weights. These assumptions are 
discussed in in the supplementary material (Appendix D), where their main consequence. 
Theorem [1] is proved. 
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Assumption 1. There is a density and 5 > 0 such that q{9*\9)aMH{(^,(^*) > 6h'{9*) for 
all 9eQ. 


Assumption 2. The support for W is uniformly (in 9) bounded above by some w < oo. 


Theorem 1. Subject to Assumption 1 the adaptive KD-tree daMH algorithm of Section \3.^ 


is ergodic. The adaptive KD-tree daPsMMH algorithm of Section \3.4\ is ergodic subject to 
Assumptions [3 and [E 


Now consider the specific, scaled, proposal defined in ([5]), where increasing ^ decreases 
the Stage One acceptance rate, Oi. The algorithm may only accept a proposal after a 
computationally-expensive evaluation (of tt for daMH, or for daPsMMH). Thus, if the 
probability, /3, that the non-DA kernel will be chosen is unaltered, then as oi —)■ 0 nearly all 
of the expensive evaluations will be by the hxed, non-DA kernel, and the relative contribution 
from the DA kernels will unintentionally dwindle to zero. 

Decreasing f3 in proportion to ai would £x the fraction of all expensive evaluations that 
are by the DA kernel; however with a smaller f3 the chain can no longer be guaranteed to 
be as close to tt after the same, fixed number of iterations. Theorem 2, which is stated 
and proved in the supplementary material (Appendix D), shows that, with /3 oc ai, as ai 
decreases the total number of iterations, n, of the algorithm should be increased so as to 
maintain the expected total number of expensive evaluations of the posterior, E , and 
that E can be set so as to maintain any given upper bound on the total variation distance 
between the chain and vr whatever the value of ai. 

Fixing E approximately fixes the expected overall CPU cost. Furthermore, adapta¬ 
tion can only occur when the expensive posterior is evaluated, and it occurs with a fixed 
set of probabilities that depend on i and not on the iteration number. So, fixing E also 
approximately hxes the expected number of adaptation occurrences. 


4 Simulation Studies 

In this section we evaluate the empirical performance of the proposed da-PsMMH and da¬ 
MH algorithms by considering two examples based upon Markov jump processes (MJPs). 


15 





The first exam ple (Section 


teractions (e.g. 


Bovs et al. 


4. ID arises from the Lotka-Volterra system of predator-prey in- 


20081). Since the marginal likelihood is intractable, we apply the 


adaptive da-PsMMH scheme and compare its performance over a range of tuning parame¬ 


ter choices with that of an optimised PsMMH sche me. The second example 
arises from the autoregulatory network proposed by iGolightly and WilkinsonI 
the s ize and complexity of this system, a linear noise approximation (LNA) (Ivan Kampen . 


(Sect ion 14.2p 
200511. Given 


20011) (see also Appendix F of the supplementary material), of the corresponding Markov 


lump process is t aken to be the inferential model of interest. Following the algorithm of 
(120141) (see Appendix F.l), the marginal likelihood under this model is 


Fearnhead et al. 


tractable, but involves the solution of a system of 14 coupled ordinary differential equa¬ 
tions (ODEs), which can be time consuming. We therefore apply the da-MH algorithm and 
compare its performance to a simple MH scheme without delayed acceptance. 

Both MJPs are described through a set of r reactions between p different species, Ai,..., Xp. 
The hazard rate of each reaction depends on the current species numbers, Xi,... ,Xp via 
an assumption of mass-action kinetics with unknown reaction rate constants z/i,..., for 
further details reg arding the constr uction of MJP representations of reaction networks we 


refer the reader to IWilkinsonI fl2012l) . Tables E.l and E.2 in Appendix E.l list the reactions 
and associated hazards for each example. 


For the Lotka-Volterra model, the Gillespie a 


using parameter values taken from IWilkinsonI (1201211 . to generate a skeleton path comprising 


gorithm 


( Gillespi^ 


( 19771) ) was applied. 


51 values of Xt at integer times in the interval [0,50]. For the autoregulatory system, the 


LNA itself was used, with parameter values taken from 


Golightlv and WilkinsonI ( 201ll) . to 


generate two skeleton paths containing, respectively, 101 and 201 values of Xt at evenly- 
spaced times covering the intervals [0,100] and [0,1000]. All skeletons were then corrupted 
with Gaussian noise to form the data sets on which inference was performed: 


Yt\Xt = Xt ^N{xt,D), ( 10 ) 

where H is a diagonal matrix with diagonal entries af,... ,ap. In the autoregulatory ex¬ 
ample we refer to the data sets with 101 and 201 observations as Vi and V 2 respectively. 
Appendix E.l provides details of the initial conditions and parameter values for each simu¬ 
lation as well as for the variances of the corrupting Gaussian noises. 
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For inference, in both cases, for simplicity, the initial state of the system is hxed at its 
(known) true value. As all of the parameters are strictly positive we therefore consider a log¬ 
arithmic transformation so that the parameter vector of interest is 6* = (log(z/i),..., log(z/r), 
log(cri),..., log(crp)). In both examples, individual components of 6 for which inference is 
performed are given independent Uniform U{—8,8) priors. For the Lotka-Volterra model, 
r = 2 and p = 2 so that dim(6*) = 4. For the autoregulatory system, p = 4 and, as discussed 
in Appendix E.l, we perform inference on r = 6 rate constants, giving dim(6') = 10. 

Random-walk proposals of the form ([5]) are used for the fixed and the adaptive kernels 
in both examples. The fixed kernels used a proposal variance of U/jxed = where S is the 
sample variance from an initial, pilot run using a fixed, non-delayed-acceptance kernel, and 
A is chosen to optimise the efficiency of the hxed kernel. The corresponding adaptive kernels 
use a proposal variance of ^^Vfixed, with .^ > 0 a tuning parameter. 

In each example we take the probability of adding to the tree at iteration n to be 

Pin = + Cin)~^ ■ (11) 

For each example, the computational cost of an evaluation of the expensive stochastic ap¬ 
proximation was estimated from the initial training run, and this provided an estimate of the 
number of expensive evaluations, i, that would ht within the computational budget. Using 
the guidelines in Section 2.6, the parameter e was chosen with a desire that points still be as 
likely as not to be added to the tree after i/2 evaluations had already been added. For both 
systems this had the effect of limiting the overall tree size to around four times the size of 
the initial training set. 


4.1 Discretely observed Markov jump process 


To implement the adaptive da-PsMM H scheme descr i bed in Algorithm 1, we used a boot¬ 


strap particle hlter with m particles (lAndrieu et ah 


20101) to obtain each value of Tig 


Both m and Vfixed were chosen so as to optimise the efficiency of the hxed kernel; see the 
supplementary material (Appendix E.2). We initialised the KD-tree using the hrst 10^ eval¬ 
uations of the expensive posterior in the pilot run. For all experiments, we assumed a hxed 
computational budget of 10^ seconds, which equated to approximately i = 40000 evaluations 
of the expensive posterior and, thus, a maximum tree size of hve times the initial size. 
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To assess the effect of scaling ^ on the overall and relative (to PsMMH) efficiency of 
da-PSMMH we fixed the number of leaf nodes in the KD-tree upon splitting to be 2b = 20, 
the number of nearest neighbours to be /c = 6 = 10 and took the parameter controlling 
the rate of adaptation to be c = 0.001 to give around a 5% chance of adaptation after half 
the computational budget. We then ran the algorithm for values of ^ G [1,4], following 
the practical advice of Section 1X51 bv choosing oc di with {3 = 0.05 for ^ = 1. Firstly, 
the sampled posterior values are consistent with the ground truth parameter values that 
produced the data (see Figure G.l in the supplementary material for the marginal posterior 
distributions for a typical run). Figure [T] shows the effect of scaling on minimum effective 
sample size (mESS) over each parameter chain relative to that obtained under an optimally 
tuned PsMMH scheme (with an acceptance rate of 9.9% and an mESS of 528) and the effect 
of scaling on the Stage 1 acceptance probability. The (scaled) values of (3 used for each run 
are also shown. Figure [U suggests that for the values of ^ considered, ^ = 3 is optimal in 
terms of mESS and gives an improvement on overall efficiency over PsMMH of a factor of 
6.8. Even simply taking the same scaling as PsMMH still gives a 3-fold increase in efficiency 
of da-PsMMH over PsMMH. 

To assess the effect of adaptation on the performance of the algorithm we fixed ^ = 3, k = 
5 = 10 and performed runs with c G:= {0.0001, 0.001, 0.01, cxd} with c = oo representing no 
adaptation. Table [2] summarises our findings. At i/2 iterations, these values of c correspond 
to an expected number of expensive evaluations before adaptation occurs of approximately 
(3, 20, 200, oo}, respectively. The larger the pause between adaptations, the less accurate 
the tree is between adaptations, and while 200 new evaluations is small compared with 10000 
or more existing evaluations, it must be remembered that the most recent evaluations will 
be from a similar part of the state space to the current position and so will be among the 
most relevant. The reduction in accuracy especially in new, low-density regions, increases 
the Stage 1 acceptance rate and decreases the Stage 2 acceptance rate giving an overall 
reduction in statistical efficiency. Moreover, the increase in the Stage 1 rate results in a 
larger number of expensive posterior evaluations. 

With c = oo the algorithm runs with no adaptation and just uses the initial training set 
of 10, 000 posterior evaluations. In this case (last row of Table [2]) performance is better than 
for the simple RWM algorithm but worse than for all of the cases that allow adaptation. 
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providing clear evidence of the importance of adaptation for our algorithm. Further, the 
more slowly pi I oo, the more efficient the algorithm. 

We also explore the sensitivity of our method to the choice of the number of leaf nodes in 
the KD-tree upon splitting {2b) and the number of nearest neighbours (fc). We fixed ^ = 3, 
c = 0.001 and took 2b G {4,10,20,30} and k G {2,5,10,15}. Table G.3 in Appendix G 
shows empirical performance for each (/c, b) combination considered. Gonsistent with our 
findings in Section 12.71 increasing b increases the efficiency until 2b = 20, but there is little 
difference when moving from 2b = 20 to 2b = 30. Fixing 2b and varying k suggests that 
fc = 5 is optimal in terms of mESS. Further discussion can be found in the supplementary 
material (Appendix G). Finally we examine the gain in overall efficiency by using a KD-tree 
as a storage and look-up method over simply storing posterior evaluations in a list. Running 
the da-PsMMH scheme with the optimal values (.^ = 3, c = 0.001 and k = 5) gave an niESS 
of 2009. Thus, in this example, using a KD-tree increases overall efficiency over a naive 
approach by a factor of 1.9. Naturally, increasing the computational budget (and therefore 
the number of posterior evaluations to be stored) will increase the advantage of the KD-tree. 


c 

Tree Size 

Mean depth 

depth range 

di 

a2 

mESS 

Rel. mESS 

0.0001 

41078 

11.82 

10-14 

0.00772 

0.339 

3845 

7.28 

0.001 

40256 

11.79 

10-14 

0.00915 

0.276 

3591 

6.80 

0.01 

43248 

12.04 

10-15 

0.0121 

0.204 

2464 

4.67 

oo 

10000 

9.69 

9-10 

0.0175 

0.136 

1829 

3.46 


Table 2: Effect of rate of adaptation c. Final tree size, mean leaf node depth, depth range, 
empirical stage 1 and 2 acceptance rate, minimum effective sample size (mESS) and relative 
mESS. 


4.2 Discretely observed ODE system 

The LNA gives a Gaussian model for Xt which when coupled with the Gaussian observa¬ 
tion model above, permits a tractable form for the marginal likelihood which we denote 
by 7r(2/i:„|6*). An algorithm for evaluating the marginal likelihood, and therefore the pos¬ 
terior (up to proportionality) under the LNA, can be found in the supplementary material 
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e e 

Figure 1: Left panel. Minimum effective sample size (mESS) relative to optimised PsMMH, 
against scaling. Right panel. Empirical stage 1 acceptance probability di against scaling. 
The points represent di(,^ = = 1) and show that was scaled in proportion 

to ai(0. 

(Appendix F.l). Executing one iteration of the algorithm requires calculation of a full nu¬ 
merical solution of the ODE system (F.l) over [0,100] or [0,1000]. Our implementation 
uses standard routines from the GNU scientihc library, specihcally the explicit embedded 
Runge-Kutta-Fehlberg (4, 5) method. We limit the computational cost of these calculations 
by applying the da-MH scheme. 

The pilot run for each dataset was of 3 x 10^ iterations. We initialised the KD-tree with 
all 3 X 10^ evaluations of the expensive posterior obtained from the initial pilot run. For all 
experiments, we assumed a hxed computational budget of 5 x 10^ seconds, which equated 
to approximately i = 120000 evaluations of the expensive posterior and, thus, a maximum 
tree size of hve times the initial size. Following the Endings of Section 14.11 we initially set 
the adaptation rate to c = 0.001, and set 2b = 20 and k = 5. 

Firstly, Figure G.3 in Appendix G shows that the sampled parameter values are consis- 
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Algorithm 

e 

c 

0^1 

&2 

mESS 

Rel. mESS 



Pi (101 obs. on 

[0,100]) 


MMH 

1.0 

- 

0.225 

1.000 

2383 

1.00 

da-MMH 

1.5 

0.0001 

0.141 

0.482 

7699 

3.23 


1.5 

0.001 

0.131 

0.480 

7638 

3.21 


1.5 

00 

0.171 

0.366 

4530 

1.90 



P 2 (201 obs. on 

0,1000]) 

MMH 

1.0 

- 

0.2291 

1.000 

632 

1.00 

da-MMH 

2.0 

0.0001 

0.0455 

0.394 

2996 

4.74 


2.0 

0.001 

0.0390 

0.338 

2779 

4.40 


2.0 

00 

0.0978 

0.165 

1485 

2.35 


Table 3: Algorithm, optimal scaling (,^), adaptation rate parameter (c), empirical stage 1 
(di) and 2 ( 02 ) acceptance rate, minimnm effective sample size (mESS) and relative mESS. 


tent with the ground truth, with a decrease in uncertainty when using more observations. 
With dataset V 2 the LNA equations must be solved over a longer time period than for Pi, 
and the steps of the algorithm for calculating the marginal likelihood (in Appendix F.l) 
must be executed twice as many times. Consequently Figure [2] (left panel) shows that the 
minimum effective sample size (mESS) obtained under da-MH is smaller when using dataset 
V 2 . However, mESS relative to the same quantity under MH is increased when using P 2 , 
since the cost of evaluating the KD-tree is unchanged (for a fixed tree size). The optimal 
scaling ^ (for the values considered) for each scheme is reported in Table [3l We also report 
output of additional runs with c G:= {0.0001, 0.001, 00 }. An optimally tuned da-MH scheme 
(with c = 0.001) gives an increase in overall efficiency of a factor of 3.2 when using Pi and 
4.4 when using P 2 . When c = 00 (representing no adaptation), we see an increase in Stage 1 
acceptance rate and a decrease at Stage 2. The resulting decrease in empirical performance 
provides further evidence of the importance of adaptation. Finally, we again note that the 
algorithm performs best with a very low value of c whilst still providing posterior output 
consistent with c = 0.001 (results not shown). 
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5 e 

Figure 2: Left panel. Minimum effective sample size (mESS) from da-MH against scaling. 
Right panel. Minimum effective sample size from da-MH relative to the same quantity from 
optimised MH, against scaling. For each panel, the solid and dashed lines indicate output 
using datasets Vi and V 2 respectively. 

5 Discussion 

We have presented standard and pseudo-marginal versions of an adaptive, delayed-acceptance 
random walk Metropolis algorithm. The delayed-acceptance (DA) step is generic, estimating 
the posterior using an inverse-distance-weighted average of the fc-nearest previous evaluations 
of the posterior, and the search for these neighbours is made fast by storing a subset of pre¬ 
vious evaluations in a customised version of a KD-tree. The kernel is a mixture of a hxed 
(non-adaptive, non-DA) kernel and an adaptive DA kernel. Easy-to-use C code for creating 
a KD-tree and storing and retrieving values from it is provided alongside this article. 

We have shown that our algorithm is ergodic, subject to conditions. Furthermore, as the 
scaling of the RWM proposal in the DA kernel is increased, the probability of choosing the 
hxed kernel should be decreased in proportion to the Stage One acceptance rate of the DA 
kernel and the total number of iterations should be scaled so that the expected number of 
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evaluations of the expensive posterior remains constant or, equivalently, so that the total 
computational budget remains hxed. 

Pseudo-marginal and non-pseudo-marginal versions of the methodology were applied, 
respectively, to synthetic data generated from a discretely observed Lotka-Volterra system 
and an ODE model of species dynamics in an autoregulatory gene network. In these two 
examples, our proposed scheme outperforms the standard scheme by factors of approximately 
7 and 4 for, respectively. 

In both of our examples the algorithm was more efficient the more slowly the adaptation 
probability approached zero, suggesting that it might be best to simply add each new eval¬ 
uation of the expensive posterior directly to the KD-tree directly rather than storing them 
in a queue. Forcing the limit of the adaptation probability to be 0 ensures the diminishing 
adaptation condition (see, e.g.. Theorem 5) is satisfied; diminishing adaptation is itself one 
of the key conditions required for ergodicity and the concern would be that by removing 
this direct constraint the algorithm would no longer be ergodic. Whilst this seems likely to 
be the case in general. Theorem 3 in the supplementary material (Appendix D.5.1) shows 
that for a variation of Algorithm 1 on a compact state space it is possible to adapt after (at 
worst) every other expensive iteration and still remain ergodic. 

We have focussed on making the cheap approximation to the posterior adaptive by inter¬ 
mittently updating the KD-tree. The covariance matrix of our random walk proposal could 
have been made adaptive by intermittently updating the cov ariance matrix of the random 


walk proposal so that it uses all entries in the chain to date fe.g. [Roberts and Rosenthal! . 


2009 


Sherlock et af 


20101) . This could even be made ‘local’, using only the /c-nearest neighbours 


in the KD-tree. It might also be possible to adaptively update the scaling of the proposal; 


however the mechani s m to use is les s obvious since, unlike in (e.g. 


Sherlock et al. 


2010 


Viholal. 


Andrieu and Thorns! . 


2008 


20121) . there is no single optimal acceptance rate for our algo¬ 


rithm. Such adaptations would be a distraction to our main innovation and have not been 
implemented. 

As we were revising this article, iKostov and Whitelev! (120161) proposed an algorithm for 


estimating the variance of the estimator of the likelihood that comes from the particle filter. 
Although we do not pursue it here, this opens up the possibility of weighting each estimate 
of the likelihood according to the reciprocal of its variance as well as its distance from the 
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proposed 6*. 


5.1 Alternatives to k-nearest neighbours 


As mentioned in Section [ll there are similariti es between the non-pseudo-marginal version of 


our algorithm and that of 


Conrad et ah 


(120141) (henceforth denoted CMPS). Here we discuss 


the approach of CMPS, highlighting both similarities to and differences from our algorithm. 
We also consider a general GP alternative to our k-NN implementation. 

CMPS fit local linear-, quadratic- and Gaussian-process (GP)-based models in neigh¬ 
bourhoods of candidate values, and at any given point in time their algorithm targets the 
approximate posterior rather than the true posterior. However, the accuracy of the approxi¬ 
mation is continually assessed and improved by carefully choosing further local points as the 
algorithm proceeds so that, asymptotically, the algorithm targets the true posterior distri¬ 
bution. Between adaptations our algorithm targets the true posterior distribution, but it is 
perturbed every time an adaptation occurs. Thus our algorithm also asymptotically targets 
the true posterior distribution. Both algorithms also use an increasing set of evaluations 
of the true posterior. CMPS chooses the next evaluation of the true posterior by design 
whereas our algorithm is “opportunistic” and potentially adds the value at each new point 
evaluated; however it is an intelligent opportunist in the detail of how it deals with new 
points which are very close to existing points. 

Our algorithm could be extended to htting local planes, quadratics or CPs to the k- 
nearest neighbours in a similar manner to CMPS, however the weighted average is simpler, 
quicker, and for it to be a sensible approach to take it only requires the posterior to be 
bounded and continuous, rather than needing additional constraints. When it is the exact 
log-likelihood that we are approximating these more complex local models might lead to an 
improvement in the accuracy of the surrogate, at the small expense of possibly having to 
use more nearest neighbours to £t. In the pseudo-marginal case, however, the efficiency will 
depend on the variance in the log-likelihood estimates and the true changes in log-likelihood. 
If the former outweighs the latter then there is little to be gained as the extra computation 
may not lead to additional accuracy. The cost of re-estimating the GP hyper-parameters 
as new training data become available is discussed in the final paragraph of this section. 
One hnal difference between the two approaches is that CMPS focusses on the non-pseudo- 
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marginal case whereas our algorithm is applied in both pseudo-marginal and non-pseudo 
marginal settings. 

Our likelihood estimate at a proposed point, 6^', is a weighted average of the likelihoods at 
the k-nearest neighbours. An alternative approach would be to use a Gaussian process (GP) 
htted to the set of {9,\ogp{y\9)) pairs currently in the kd-tree. GPs hav e be en used to ap¬ 


proxim ate the log-likelihood previously. For example in iRasmussenI (120031) and iFielding et ah 


fl201lh a GP provides a cheap surrog ate for Hamiltoni an Monte Garlo calculations. Altern- 
tively, in the context of ABG-MGMG, I Wilkinson! (120141) uses a Gaussian process to mode l the 
logarithm of the ABG approximate likelihood function, while iMeeds and Wellingl (120141) ap¬ 
proximate the joint synthetic likelihood at the current and proposed point using independent 
GPs for each summary statistic. 

As with our approach, the final point estimate from a GP is a weighted average of 
existing values. However, by estimating the parameters of the GP one may represent the 
scales of variability more accurately and so obtain more accurate point estimates than with 
our inverse-distance weighting approach. The GP model also supplies an estimate of the 
uncertai nty in the point estim ate of the log-likelihood. 


As in 


Gonrad et ah 


(120141) . the estimate of uncertainty allows for a choice of new training 


points to minimise the variance in some region o f interest, with th e pot ential of a large re 


Wilkinson 

(2014 

) and 

Meeds and Welling 


(120141) which use this approach target an approximation to the true posterior and an accu¬ 


rate GP approximation is essential for the algorithm to be useful. By design, our algorithm 
overlays a standard MH or PMMH algorithm and it automatically targets the true poste¬ 
rior; an inaccurate approximation reduces the mixing efficiency but does not invalidate the 
algorithm. Again, since our algorithm overlays a standard MH algorithm the true likelihood 
(or an unbiased estimate) must be evaluated at every accepted point and it is not imme¬ 
diately obvious how estimates of uncertainty might be used to circuvent this requirement 
while maintaining the true posterior as the target. 

Finally, estimating the hyper-parameters of a GP is computationally very costly, and this 
estimation should be repeated as the training data set grows. For this reason and because 
of the difficulty in identifying d{d + l)/2 kernel range parameters, GP methods often use a 
diagonal covariance structure for the kernel, limiting the flexibility. Our approach of using 
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the covariance matrix of the sample from an initial training run to provide a map to a 
new parameter space where Euclidean distance is appropriate has a similar flavour to the 


pragmatic approach of using an initial tra ining samp 


e to fi 


then keeping them hxed; alternatively, see 


Shen et al 


the GP hyper-parameters and 


fj2006[l for a partial solution. 
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Supplementary material for Adaptive, 
delayed-acceptance MCMC for targets with 

expensive likelihoods 


A Operations on the KD-tree 

In this appendix we describe the standard operations of creating a balanced KD-tree from a 
dataset and of searching throngh the KD-tree for the K-nearest neighbours. 

A.l Creating a balanced KD-tree from an initial dataset 

Here we suppose the availability of Uq pairs of a co-ordinate vector 9, and a vector of interest, 
V. The algorithm for creating the tree proceeds recursively until the number of leaves in the 
node is less than 2b. 

The recursive function commences at the root node with dspiu = 1, and a data set D, 
which is all of the data, so that n = \'D\ = uq. 

1. If n < 26 then hnish this recursion; this node is a leaf node and contains data T) and 
splitting component dspiu- 

2. Otherwise n > 26, so hnd the median fidsput of fh® 1^1 scalar values for 

3. Place all points for which Od^^^ < into a set C. 

4. Place all points for which Od^^^ > into a set 71. 

5. If any points have = Md^put then use independent Bernoulli trials with probability 
0.5 to place each in £ or 7Z. 

6. Create a left and a right child for this node. 

7. Recursively call this function for the left child using V = C and dspiu = dgpiu © 1- 

8. Recursively call this function for the right child using V = 71 and dspiu = dspiu © 1- 
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A.2 Finding the k nearest neighbours of a point 

The algorithm below hnds the k nearest neighbours to a point 9*. For simplicity it, assumes 
that /c < fe; it is straightforward (but messy) to alter it to allow any k. 

Stage One 

1. Descend from the root node to find the leaf node to which 6* would be added if we 
wished to add a new entry {9*, Iq*). 

2. Find and store an ordered list of the k points on this leaf that are nearest to 9*. Keep 
track of the distance of the k^^ furthest point from 9*. 


Stage Two is only applied if the leaf note reached by Stage One has a parent; this is the case 
in all but the trivial tree. 

Stage Two: This part of the algorithm is initialised with a list of k nearest neighbours 
and a value Tmax, both obtained from Stage One. It uses an [ascending/descending] flag 
and commences at the parent of the leaf node reached by Stage One with the flag set to 
ascending. To aid the linguistic flow we use the verbs ascend and descend as shorthand 
for calling Stage Two with the flag set to ascending and descending respectively. If the 
current node is a branch, let 9spiit and dsput refer to the current node. 


1. If ascending: 


m If 


OspHt - 


split 


< fmax then set u = where e* is the k-vector with all 

elements set to zero except for the zth, which is set to one, then descend to the 
child (of this node) from which the algorithm has not just ascended. 


(ii) If 


split 


9* 


split 


^ f'max and if this node has a parent then ascend to this parent. 


2. If descending and the current node is a leaf then calculate the distance from every 
point on the leaf to 6 * and sort the leaves according to this distance. If any are within 
i^max then update the list of the k nearest neighbours and update r^ax- 


3. If descending and the current node is a branch then if 9*^ < 9spin define the near 

child to be the left child, otherwise define it to be the right child. Similarly, define the 
far child to be whichever child is not the near child. 
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(i) Descend to the near child with u unchanged. 

(ii) Update to Ogpiu, keeping the other elements of u unchanged. If ||u|| < r^ax 

then descend to the far child. 


B Proofs of Propositions 1 and 2 


Proof of Proposition 1 


Let B* and Bl be the e balls around 9* and 6*j respectively and denote the empty set by 0. 
Firstly, since 9i {i = 1,... ,n) are identically distributed, 

1 - P(01 ^ 5;,... ^ B:) = P(0i e B: or.. .or e B,) < nF {9, G B,) = 


Hence Pkeep ^ 1 

Secondly, requiring that one point be outside of the e ball of another point is equivalent 
to requiring that the e/2 balls of both points do not intersect. So, whatever the relationship 
between 9i,, 0j_i, 


p {9i i b:\9^ ^ h:, ..., ^.-1 i Bl) = p {b\i^ n i?:/2 = ^\b\i^ n i?:/2 

— ^ (-^e/2 = 0) 

= F{9,iBX) 




since the conditioned event only rules out portions of M.'^\B *^2 Hence 

n 

Pkeep = P (01 ^ B:)1[F{9, i Bl\9^ ^ H/,..., i Bl) 
i=2 


< P (01 ^ B*)^ = (1 - En,e/n)^ < 


Proof of Proposition 2 

In this case the distance, D between any two points chosen independently satisfies iD^/2 ~ Xd- 
The 0j, {i = 1,... ,d) are not mutually independent since none can be within e of any other, 
but this is irrelevant to the calculation that follows. Let Nn,e ■= then 

n 

En,e = 5^P(0i G B{9,,e)) = nE^2{ey2), 

i=l 

as required. 
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C Adaptive-KD-tree, delayed-acceptance pseudo-marginal 
algorithm 

Algorithm 2: adaptive-KD-tree, delayed-acceptance, pseudo-marginal Metropolis-Hastings. 

At the start of iteration n, let the current parameter value be 6 and let the cheap ap¬ 
proximation to the posterior be 7ra{6) with the expensive unbiased estimate of the posterior 
denoted by ffsiO). 

1. With probability (3 go to Step [2] (PsMMH) else go to Step [3] (da-PsMMH). 

2. PsMMH Propose 6* from q{6*\6). Evaluate T^s{d*) by effectively proposing W* from 
q{W*\6*). Accept the proposal {6 ^ 9*) with probability given by (9); otherwise reject 
the proposal {9-^9). Go to SteplH 

3. da-PsMMH Propose 9* from q'{9*\9). 

(a) Stage 1: Evaluate T^a{9*) using the current KD-tree; with probability ai{9, 9*) as 
defined in (2) (but with q' instead of q) proceed to Step [3b] (Stage 2); otherwise 
reject the proposal (9^9), set = i„_i and go to next iteration. 

(b) Stage 2: Evaluate 7rs{9*) by, effectively, proposing W* from q(W*\9*). Accept 

the proposal {9 <(— 9*) with probability a2,s{d,9*) as dehned in (10); otherwise 
reject the proposal (0 6*). Go to SteplH 

4. Set in = in-i + 1; add {9*,ns{9*)) to a list of recently evaluated parameter/posterior 
pairs; with probability pi^ add all pairs from this list to the KD-tree and then remove 
all pairs from the list; go to next iteration. 

D Theorems 1, 2 and 3: discussion, statement and 
proofs 

We discuss Assumptions 1 and 2, then place our algorithms in the more general framework 
in terms of which our proof of ergodicity is phrased. Next, we state and discuss Theorem 
2 and then show, as Theorem 3, that it is not always necessary to force the adaptation 
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probabilities, Pi to tend to zero. Finally we prove Theorems 1, 2 and 3. For simplicity, for 
Theorems 1 and 2, we consider only the daPsMMH algorithm since the daMH is a special 
case of daPsMMH with W = W* = 1. 


D.l Discussion of Assumptions 1 and 2 


Roberts and Rosenthal! (120041 ) show that any Metropolis-Hastings kernel on a compact state- 


space satisfies Assumption 1 provided 7i{9) is continuous and q{0*\9) is continuous and posi- 


tive. In the apDlications of interest to us all o ’ 

;he parameters are positive and it is common 

practice (e. 

R- 

Go 

iehtlv and Wilkinson. 

2005: 

Picchini. 

2014: 

Koblents and Mieuez. 

2015; 

Owen et ah 


>015 

) to place a vague but proper uniform prior on the logarithm of each pa- 


rameter; furthermore, tt is continuous. We may therefore choose q so that Assumption 1 is 
satished. If 0 is not compact then Assumption 1 still holds, for example, if the proposal 
density q{9*\9) = q{9*) is an independence proposal which satishes q{9*) > 67r{9*) since then 
q{9*\9)a{9, 9*) > q{9*) A 67r{9*) = 67r{9*). 

Whether or not Assumption 2 is satished depends upon the exact pseudo-marginal 
method used. Suppose, as in our examples, that an unbiased estimate of the likelihood, 
P{y\9), (and hence, up to a constant, of the posterior) is obtained using a bootstrap particle 
hlter. This provides P{y\9) as a hnite product of Monte Carlo averages of likelihood terms, 
where each of these terms arises from an assumed distribution for the error in the observation 
of some stochastic process Zt. Suppose, for example, that Yt\zt ~ N{zt, 9^) and that the state 
space for Zt is bounded, so that 0 < at < {yt — ZtY <bt < oo. Then loghF < {pt — at)/{291) 
and Assumption 2 will therefore hold provided the support for 9i is bounded away from zero. 


D.2 More general set up 

We set up the algorithm in more general terms since, in addition to the specific algorithms we 
have described, our ergodicity result. Theorem 1, is applicable to a general class of adaptive 
pseudo-marginal algorithms and, to our knowledge, is the first such result. 

Let F be a non-adaptive MH kernel on 0 with proposal q{9*\9), and let {P-y}'yeg be a 
(usually infinite) set of kernels on 0, where Fy has proposal q^{9*\9). All kernels are assumed 
to have the same stationary density, 7r(6*). At iteration n the adaptive kernel that will be used 
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with probability 1 — /9 is and, since the Markov chain is adaptive, let 7 „ be a realisation 
of the random variable r„, which depends on the history of the chain. At iteration n, our 
kernel is 

P;^ =(3P+{1- for some /3 e ( 0 , 1 ). (D.l) 

In our case, each (7 G G) uses the same initial proposal q'{9*\9), and this proposal will 
typically differ from the proposal, q{9*\9) used in the hxed kernel. However, our proof of 
ergodicity applies in the more general set up of fID.lj) . 


D.3 Theorem [2]: choice of n and {3 as functions of a\ 

Consider a collection of kernels, indexed by ^ G S, each a mixture of an adaptive kernel, 
indexed by and a common fixed kernel: 

P< = + (1 _ for some 3 G (0,1) and 7 „ G G^. (D.2) 

Let P use a proposal q, such as that in ( 6 ). For a given value of p all P|, 7 G use the 
same proposal, gC this could be the RWM proposal defined in ( 6 ), for instance. 


Theorem 2. Consider a set of adaptive daPsMRWM Algorithms indexed by p as described 
in and around flD.2|) . Let the common fixed kernel, P satisfy Assumption 1 and let all of the 
adaptive kernels have the same stationary density as P, given in (8); in particular, all use 
the same mechanism for generating W which satisfies Assumption 2. In addition to (5), let 
{PijieN be a strictly decreasing sequence. For each p after n iterations let be the number 
of evaluations of the expensive posterior and let be the average Stage One acceptance 
rate. Assume that given e > 0 3 icmde such that for all n with E \lO > icmde o>nd all f. 


P < 2 ^) > 1 - e 


For some fixed k > 0, we then set 

/3^ = nxP (D-3) 

Subject to the above E ^00 as n ^ 00 . Further, for any e > 0 and all f E E it is 
possible to choose a single E > 0 so that if E > E, for some n, then the TVD between 
the Markov chain and tt after n iterations is bounded by e. 
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Thus, provided that after a certain number of expensive iterations the acceptance rate is 
typically no more than a constant multiple of the long term average, it is safe to set the 
probability of choosing the fixed kernel proportional to a^. 

As the chain progresses and the number of expensive evaluations of the posterior increases, 
the representation of the posterior by the KD-tree improves and ai^n initially increases (with 
a typical relative change of around 5-10%) before settling down. In practice, we hnd the 
condition on the convergence of the Stage 1 acceptance probabilities to be a reasonable 
assumption. 

D.4 Diminishing probabilities of adaptation 

After every expensive iteration. Algorithms 1 and 2 store the newly-evaluated expensive 
posterior; they ensure that the diminishing-adaptation condition (see Theorem [5]) is satisfied 
by, after the Ah expensive iteration, adding all stored expensive posteriors to the KD-tree 
with probability Pi —?■ 0. Tables 2 and 3 suggest that the algorithm is more efficient the 
more slowly pi J, 0. The intuition behind this, and the possible consequences for higher¬ 
dimensional systems if Pi were to remain bounded away from 0 are discussed in Section 5; 
the concern is that such an algorithm might not be ergodic. In Appendix ID. 5. 31 we prove 
the following. 

Theorem 3. Let 7i{6) be a continuous density with respect to Lebesgue measure on a hyper- 
rectangular state space, 0, and consider an adaptive, delayed-acceptance MH algorithm the 
same as Algorithm 1 but with the following alterations: Vi G N, p 2 i = 1 and P 2 i+i = 0, 
and we remove the restriction on the growth of the tree described in Section 2.6. If the 
kernel satisfies the minorisation condition. Assumption 1, then the diminishing adaptation 
condition is automatically satisfied. 

D.5 Proofs of Theorems 1, 2, and 3 

Throughout this section the total variation distance (TVD) between any two probability 
measures, z/(-) and vr(-), is denoted ||z/ —7r||. We require the following definition and two 
results. 
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Definition 1. Consider a Markov kernel P on a statespace X. A snbset C* C A is small if 
there exists a positive integer no, e > 0, and a probability measure n(-) on X such that the 
following minorisation condition is satisfied 




(D.4) 


Theorem 4. /(Roberts and Rosenthal \2004) Consider a Markov chain with invariant proba¬ 


bility distribution 7r(-). Suppose that the entire statespace is small (i.e. flD.411 is satisfied with 
C = X). Then the chain is uniformly ergodic, and in fact \ \P^(x, ■) — 7r(-)|| < (1 — 


Theorem 5. ((R.oberts and R.osenthal. \200li) Consider an adaptive MCMC algorithm on a 


statespace X with adaptive kernels Py, 7 G and with 7 r(-) stationary for each P^; 7 G 
Under the following conditions the adaptive algorithm is ergodic. 


1. (Simultaneous uniform ergodicity) For all e > 0, there is N = N{e) G N such that 
||P^(a:, •) - 7r(-)||^^ < e for all x e X and 7 G 

2. (Diminishing adaptation) For any Xq = Xq, Tq = 70 , 

sup||Pr„+i(a:,-)-^r„(p-)||Ty ^ 

xGX 

where the convergence in probability is with respect to the distribution of r„ and r^+i 
given xq and 70 . 


D.5.1 Proof of Theorem 1 

We now show that a class of adaptive Metropolis-Hastings algorithms, which includes our 
algorithms, is ergodic subject to Assumptions 1 and 2. In line with our KD-tree approxima¬ 
tion, we assume that the target, tt, and the proposals, g, q' and qg (6* G 0), are all densities 
with respect to Lebesgue measure. 

Any delayed-acceptance algorithm is simply an accept/reject Markov chain with a non¬ 
standard acceptance probability. The key point is that detailed balance is preserved since 
(for the pseudo-marginal version) 

7r{e)qg{w)w q{e*\e)qg*{w*) ai{e,e*)a2,PM{[0,w], 

= 7rfie)q(,9*\9)ai{9,9*) x qg{w)qg*{w*) X 02 ,pm{[9,w], [6>*,m;*]), 

'^c[9) 
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and each of the three terms in the product is invariant to {0,w) -H- {6*,w*). We therefore 
prove ergodicity (subject to conditions) for any adaptive pseudo-marginal algorithm of the 
form given in and above fID.lj) . 

We require the following components. Here A and A denote any (Lebesgue) measurable 
subsets of 0 and 0 x W respectively, and 6 represents the Dirac delta function. 

1. A hxed pseudo-marginal kernel on 0 x W with stationary density 7r{d)qg{w)w: 

P (^[6', w], = (1 - apM ([6*, w])) 6a {[0, w]) 

+ [ d9* dw* q{9*\9)qe*{w*)apM{[0,w],[9*,w*]) . 

JA 

Here apM{\9,w]) is the acceptance probability from the current value: 

apM{[9,w\) = f d9* dw* q{9*\9)qe*{w*)apM{[9,w], 

Jbxw 

2. The corresponding hxed ‘ideal’ kernel on 0 with stationary density vr(0), 

P{9,A) = {l-aMH{9))6A{9)+ [ d9* q{9*\9)aMH{9,9*), 

JA 

from which we are unable to sample because 7r{9) and vr(6**) are needed in order to 
evaluate omh- Here aMH{9) = /q d6'* q{9*\9)aMH{9^9*). 

3. A set of additional (pseudo-marginal) kernels on 0 x W: {P^{\9, w], •)} 7 eg 5 as described 
in and above (ID.ID . and with the same stationary density as P. 

4. A sequence of probabilities satisfying (5). 

The generic algorithm is then: 

Algorithm 2b: generic, adaptive, pseudo-marginal, propose and accept/reject algorithm. 
Iteration n commences with current value and kernel index 7 „ and involves 

the following two steps. 

1. Sample from P*^ as dehned in (ID.ip . 

2. With probability Pn update 7 „ (i.e. update the adaptive kernel) by including all relevant 
information obtained since the kernel was last updated to create a new kernel. 
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Theorem 6 . Subject to Assumptions 1 and 2, Algorithm 2b is ergodic. 


Proof The condition (5) ensnres that the diminishing adaptation condition of Theorem 
[5] is satished. 

We next show that snbject to Assnmption 1 , P satishes a similar condition to As- 
snmption 1 and hence so does each of the kernels in flD.l|) . This is then shown to ensnre 
simnltaneous nniform ergodicity. 

We hrst dehne 

:= o{9)-qe{w)w, 
c 

which is a density by (7); we refer to the corresponding measure as h(-). From (ID.61) . 

w* c5 

q{9*\9)q04w*)apM {[9,w],[9*,w*]) > q{9*\9)q0*{w*)—aMH{O,9*) = —i>{9*,w*). 

This implies that for any 7 and any measurable set A G 0 x [0 ,mJ], 

Hence, the entire-statespace (0 x [0 ,mJ]) is small with no = 1 and 


e = 6 := cjdd/w] 


(D,5) 


(5 < 1 since c/w < 1. Each kernel therefore individually satishes the condition of Theorem 
m and hence N = [loge/(l — 5)] + 1 ensures that the collection of all kernels P* satisfy 
Condition 1 of Theorem [5l ■ 

Although the kernel in Algorithm 2b is more general than that in Algorithm 2 , in one 
particular sense Algorithm 2 is not a special case of Algorithm 2b, since the former potentially 
updates the kernel only after each expensive evaluation rather than after each iteration. 
Adaptation times enter the proof of Theorem [B] through the diminishing adaptation condition 
(see Theorem [5] of this article), for which it suffices that the probability of a change in the 
kernel at any given iteration, n, tends to zero as n —?■ cxd. For Algorithm 2 this is guaranteed 
through Condition (5); however it also holds for Algorithm 1 since lim^^oo^n = C )0 almost 
surely, as we now demonstrate. Subject to Assumption 2 , the acceptance probability for the 
hxed kernel between [9,w] and [9*,w*] is 

/ w* \ w* 

c^PM {[di w], [9*, u;*]) > umh ( 1 a — I > —aMnid, 9*). (D.6) 

\ w J w 
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By (7), the average acceptance rate from [9^w\ for the fixed kernel is 


(^PM {[9-,w]) > [oiMH{d,0*)] — - _ ^ ^ . 

w w 


he id ealised Metropolis-Bastings kernel P{0, ■) is unif ormly ergodic by Assumption 1 flRoberts and Rosentl 


(1200411 1 so aMniO) is bounded below by some ao > 0 (IRoberts and Tweedid (119961) Proposi 


tion 5.1). Hence the overall acceptance rate is bounded below by (3aQc/w, and by the strong 
law of large numbers, lim„_j.oofn = oo. 


D.5.2 Proof of Theorem [2] 


he p roof of ergodicity of adaptive MCMC algorithms in Theorem 5 of 


Roberts and Rosenthal 


(120071) relies on a hypothetical Markov chain (for us, x' := {[6^', which is identical 


to the real chain up until some iteration no and then continues in parallel with the real chain 
using the kernel at no without any further adaptation. The kernels for this chain are 
where 

7n n<nQ 
7no ^ ^ ^0- 

After iteration no, the hypothetical chain clearly has tt as its stationary distribution. Theo¬ 
rem m then informs us that after a further ni iterations 


7n = 




7n0 ’ 



\ni 


(D,7) 


for any x, including the [6, w\ value of the chain after no iterations; the hypothetical chain is 
close to the target. Diminishing adaptation is then used to show that if no is large enough 
then after these further rii iteratio ns the real chain is close to the hypothetical chain. 


We use the same approach as in iRoberts and Rosenthall (120071) and show that for a given 


required TVD between the true chain and tt, the required run length can be specified in 
terms of the expected number of expensive iterations. Since Stage Two acceptance proba¬ 
bilities are irrelevant to our argument and henceforth denote, respectively, the Stage 
One acceptance probability and its average after n iterations. Throughout this proof, for 
simplicity of notation, we will suppress the superscript ^ from the terms Uq, 

nf, J^, J^. The point is that the rate of the convergence in (ID.lOj) and the bound in (ID.131) 
only depend on ^ through the expected number of expensive evaluations, 
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Let In and Jn be, respectively, the number of expensive evaluations of the true algorithm 
between iterations 1 and n and between iterations no + 1 and no + n. First we dehne the 
following events: 


An := {an < 2a} and Bno,^ ■= {1 - (1 - < e}. (D.8) 


The probability that an iteration involves an evaluation of the expensive posterior is 

Pn := + (1 - (I)an = a + (1 - . 


Thus 


E [In] = pj = na f K + (1 — Ka)^ j . (D.9) 

j=i ^ / 

So E [In] > nan and E [/„] — )■ cx). Now, Var[/„ — In-i] = Pn(l — Pn) < pn and whether or not 
each iteration is expensive is a sequence of independent Bernoulli trials, so Var[/„] < E [/„]. 
Combined with Chebyshev’s inequality shows that as E [J„] —)■ oo, /„ —)■ oo in probability 
and hence that the diminishing adaptation probabilities satisfy 


E |p;„l -> 0. 


(D.IO) 


By assumption, for E [J„] > icmdei P (-^n) > 1 — e. Further, from flD.9p . conditional on 
Ant E [Jn\ < na {k + 2). Thus 


. - E [Jn] 

Conditional on An, na > -, 

k + 2 


(D.ll) 


The TVD between the true adaptive chain and the hypothetical chain described in the 
preliminaries for this proof is b ounded above by the probability that they are not coupled (e.g 
Roberts and Rosenthal! . l2004l) . and since adaptation can only occur after the true posterior 


(or an unbiased estimate thereof) is evaluated, the TVD between the real chain and the 
hypothetical chain after uq + rii iterations is less than 1 — (1 — Given flD.7p . 

the triangle inequality and the monotonicity of p*, the following two conditions, therefore, 
guarantee that the TVD between the true chain and vr is less than 2e: Bno,ni and 

(D.12) 

By fID.Sp and fID.Sp . 6 = acnS/w < 1 since c/w < 1. Also log(l — 5) < —5. Thus, 
rii log(l — (5) < —ni5 = 


uiacnS ^ , ckS 

<-E[Jn,] 


w 


w{k + 2 ) ’ 
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by fID.lip applied to the iterations from no + 1. Hence, conditional on flD.12D can be 
guaranteed by fixing E[J„j] (i.e. choosing ni) 


E [J„J > 


w{k + 2) loge 
ck6 


(D.13) 


To deal with for this hxed E [Jm], apply Jensen’s inequality twice and then (ID.lOp : 

E [(1 > E [(1 -p/.J'l-'-ll > E [1 -p/.j'l-'-l ^ 1 


as E [Ino] —)■ C) 0 . Hence, by Markov’s inequality, for sufficiently large E [J„g] we can ensure 
P(f3no,ni) > 1 - e. 

With these choices of E [/„(,] and E[J„J, and hence of E = E [/„g] + E[J„J, and 
Bno,ni each holds with probability 1 — e, so the TVD between the true chain and tt is less 
than 4e. 


D.5.3 Proof of Theorem [3] 

We denote the k-nearest neighbour approximation to the posterior at 9* after n iterations 
by nn{9*) and a ball of radius r centred at 6 by Br{9). We denote the Stage One and Stage 
Two acceptance probabilities at iteration n by ai^n{9,9*) and a2^n{9,9*). 

For 6^ G 0 let 


An{6,r) := {after n iterations all k nearest neighbours to 9 he within Br{9)}, 


and note that H„_i(0,r) H„(0,r). If H„(0,r) holds then then any change (adaptation) 

in the cheap estimate T^n{9) must occur through one or more new points being added to the 
tree inside Br{9). Since logvr is continuous, for any e > 0 3 eo such that if H„_i(6*,eo) and 
yl„_i(6'*,eo) hold, then 


1 1 I ^ 

f — ^ < - - 77 W < 1 + e and 

T^n-l{9) 


T^n+l{,9*) 

T^n{9*) 


< 1 T £• 


and P 7 „+i differ only in their acceptance probabilities, with using the ratio of 'jfn{9*) 
and 7tn-i{9) in both and a 2 ,n- Yet, subject to H„_i(6*,eo) and An-i{9*,eo), 


\4 Oli^n+l{9,9*)a2,n+l{9,9*) 

^ <yi,n(9,9*)a2,n(9,9*) 


<(l + 6)h 
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Since 0 is compact, logTr is uniformly continuous. Hence, for small enough e, and subject 

OH <5£. 


We show, given any e > 0, 3 snch that P {r\g*^eAn{0*, eo)) > 1 — 2e, for all n > n^. So for 
n > n„ snpg \ ■) - P^^{9, ■)\ \ < 7e. 

First, partition the (hyperrectangnlar) state space, 0, into ria hypercnbes of size e^, 
snch that any ball of radius Co must contain at least one hypercube. We now place n 
points uniformly at random in 0 (i.e. according to a homogeneous Poisson process, U). 
Denote the number of points that fall in the ith hypercube by □*. For any hxed fc G N, 
P (3 i G {1,..., nn} such that □* < /c) < rinP (Di < /c) —)■ 0 as n —)■ cx). Hence there is an 
n, such that for all n > n,, P (3 i such that Dj < k\n points from U) < e. 

The minorisation condition holds over the whole state space, and each kernel Hy is re¬ 
versible; moreover adaptation only occurs on even-numbered exp ensive iterat i ons so that 


each kernel is used (at least) twice before adaptation. Lemma 26 of 


Crain et ah 


(1201, then 


implies that for each pair of iterations there is a probability of at least /3^(5^/4 of sampling 
from 71. Since logvr is continuous and 0 is compact. 


mingge 7r{9) 

P •=- 7q\ > ^ 

max06e7r(6') 

So whenever a sample from vr is obtained, a sample from the homogeneous Poisson process, 
U, may be obtained with a probability of at least p. Dehne the event 


Cn '■= {3 at least one sample from U in n iterations}. 

Then P (C^) < (1 — and, given e > 0 and n, there is an Uonce such that for all 

n > Uonce, P (Q) < Hence for all n > n, := UonceU,, 

P (after n iterations 3 i G {1,..., nn} such that Di < k) < 2e. 


If there are at least k entries in each hypercube then, since all balls of radius r contain at 
least one hypercube, the k nearest neighbours to all 6** G 0 must be within Br{9*). 
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Label 

Reaction 

Hazard 

Description 

Ri 

^ 2Ai 

yiXi 

Prey reproduction 

R2 

Ai + A 2 ^ 2 V 2 

122X1X2 

Prey death, predator reproduction 

Rs 

X2^% 

^3X2 

Predator death 

Table E.l: Reaction list and hazards for the Lotka-Volterra system. 

Label 

Reaction 

Hazard Description 


Ri 

DNA + P2 ^ DNA - P2 

Z/1X1X4 

Dimer binding 

R2 

DNA - P2 DNA+ P2 

Mk-Xi) 

Dimer unbinding 

R3 

DNA ^ DNA + RNA 

123X1 

Transcription 

/?4 

RNA ^ RNA+ P 

124X2 

Translation 

R5 

2 P ^ P2 

12,X3{X3 - l )/2 

Forward dimerisation 

R& 

P2 ^ 2 P 

12qX4 

Reverse dimerisation 

Rj 

RNA ^0 

Z/7V2 

RNA degradation 

Rs 

P ^0 

^3X3 

Protein degradation 


Table E.2: Reaction list and hazards for the anto-regulatory system. 


E Simulation study: model details and further results 


E.l Model and inference details and data simulation 


Tables IE. II and IE.21 list the reactions and associated hazards for the Lotka Volterra and 
autoregulatory examples, respectively. 

A single data set was simulated from the Lotka-Volt e rra M JP using an initial value of 
Xq = (71, 79) and parameter values taken from lWilkinsonI (120121) . that is n = (1.0, 0.005, 0.6). 
Each Xt, (t = 1,..., 50) was corrupted as in equation (11) with ai = a 2 = 8. 

For the autoregulatory system we generated 2 synthetic datasets (labened_as^]_and_P2l. 


by ta. 


ing Xq = (5, 8,8,8), K = 10 and using parameter values taken from lGolightlv and Wilkinson 


( 201ll ). that is z/ = (0.1, 0.7, 0.35, 0.2, 0.1, 0.9, 0.3, 0.1). Simulated values were corrupted with 


noise as in (11) with cxi = (72 = 0.5 and = 1. Dataset "Di consists of 101 observations 

on the time interval [0,100] and V 2 consists of 201 observations on the time interval [0,1000]. 
For the autoregulatory system, the total number K of DNA- P 2 and DNA is hxed through- 
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out the evolution of the system, for in our inferences it is assumed to be known, so that the 
model comprises of 4 species. We denote the nun iber of molecules of DNA , RNA, P and P 2 
as Xi, X 2 , X 3 and X 4 respectively. As noted by Golightlv and Wilkinson (j^On), the rate 
constants in the reversible reactions can be difficult to infer and we therefore £x ui and 1^5 at 
the ground truth. We consider inference for the remaining rate constants and the observation 
error standard deviations, 10 parameters in all. 


E.2 Tuning parameters for the fixed kernels 


Sherlock et ah 


For infe rence on the Lotka Volterra system, we follow the practical advice of 
2015bl) by choosing the number of particles, m, so that the variance (r^) in the log-posterior 


at the median (estimated from the pilot run) and 4 additional sampled parameter values is 
less than 3. We set m = 200 since this gave G [2.06,3.07]. Under certain assumptions 


regarding target a nd the variance i n the 


Sherlock et al. 


og-posterior, in the case that the target is approx- 


fj2015bl ) found that the scaling of the proposal variance 


imately Gaussian, 

should be approximately A = (2.56^/(i) to optimise efficiency. We found further scaling this 
quantity by 1.1 appeared to give optimal performance in terms of effective sample size (ESS) 
per second giving Vfixed = 1.1 x (2.56^/5) x S. 


For the autoregulatory example, following [Roberts and Rosenthall (120011 ). we found that 


scaling variance estimated from the tuning run by A = (2.38^/(i) with d = 10 gave optimal 
performance of the fixed kernel (ie without delayed acceptance). 

The choices of e = 0.3065 for the Lotka Volterra and e = 0.982 for the autoregulatory 
system followed from considerations described immediately after ( 12 ). 


F The linear noise approximation 

The linear no ise approximation (LN A) of the Markov jump process dehned by the reaction 


hazards (e.g. iFearnhead et al.l. l2014l) ignores discreteness (but not stochasticity) and gives 
the state Xt as a Gaussian: Xt ^ N {zt + mt, Vt), where Zt, rrit and V) satisfy a coupled 
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ODE system 

( 

Zt= S h{zt, v) 

< mt= Ftmt (F.l) 

Vt = VtF]; + ^diag {h{zt, u)}S^ + F,Vt 

\ 

Here, h{xt, v) is the length-8 column vector containing the reaction hazards, is a 4 x 4 
matrix whose entry is given by the hrst partial derivative of the ith component of 

S h{zt, v) with respect to the jth component of Zt and S is the 4x8 stoichiometry matrix 
whose (i, j)th element gives the effect of reaction j on species i. 

F.l Marginal likelihood under the linear noise approximation 

For simplicity of exposition we assume an observation regime of the form 

Y^ = X, + et, ei~N(0,S) 

where e* is a length-d^. Gaussian random vector and t = 0,1,...,n. Suppose that Xi is 
hxed at some value xi. The marginal likelihood 7 r{yi-.n\(^) (and hence the posterior up to 
proportionality) under the LNA can be obtained as follows. 

1. Initialisation. Compute 

= 0(|/1; Xi, S) 

where 0 {yi ; , S) denotes the Gaussian density with mean vector xi and variance 

matrix S. Set ai = xi and C to be the dx x dx matrix of zeros. 

2. For times f = 1, 2,..., n — 1, 

(a) Prior at f -I- 1. Initialise the LNA with Zt = at, rrit = 0 and V) = Cf. Note that 
rus = 0 for all s > t. Integrate the ODE system flF.lD forward to f -|- 1 to obtain 
Zt+i and 10+1. Hence 

^t+i\yi:t, d ~ N{zt+i, 10+l) . 

(b) One step forecast. Using the observation equation, we have that 

Yt+i\yi-.t,d ^ N {zt+i,Vt+i + T.) . 
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Compute 


7r{yi:t+i\9) = 7i{yi-.t\9) 0 {yt+i ; zt+i , Vt+i + S) . 

(c) Posterior at t+1. Combining the distributions in (a) and (b) gives Xt^i\yi,t+i, 9 ~ 
N{at+i,Ct+i) where 

ot+i = zt+i + Vt+i (Vt+i + S) {yt+i — zt+i) 

Ct+i = Vt+I - Vt+1 {Vt+i + S)-' . 

G Additional graphics and discussion from the simu¬ 
lation study 



Figure G.l: Marginal posterior densities of (i = 1,... ,5) based on the (thinned) output 
of da-PsMMH. 


With regard to Table IG.31 we impose the restriction that k < b, since choosing k > b 
automatically implies that the k nearest neighbours to a particular parameter value will, at 
some point, be split over more than one branch node. We found that using k = 2 reduces 
the computational cost of searching the tree but also reduces the accuracy of the KD-tree 
approximation, resulting in an overall decrease in mESS. Similarly, for k > 5 the increased 
accuracy is offset by increased computational cost. Using k = 5 and 2b = 20 gave a 7-fold 
improvement in overall efficiency over PsMMH. 
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Figure G.2: Log-posterior estimates under the MJP (log(7rs(6*))) against the corresponding 
log-posterior estimate given by the KD-tree (log(d'c(6'))) based on the training data only (left 
panel) and the hnal adapted tree obtained after running da-PsMMH for 10^ seconds with 
^ = 3 and fc = 6 = 10 (right panel). Both plots are obtained using 5, 000 values of 9 sampled 
from the posterior 7r(6*). 



2 b 

k 

4 10 20 30 

2 

2807 (5.3) 2715 (5.1) 3435 (6.5) 3423 (6.5) 

5 

3017 (5.7) 3871 (7.3) 3848 (7.3) 

10 

3591 (6.8) 3377 (6.4) 

15 

3330 (6.3) 


Table G.3: Minimum effective sample size (niESS) and relative mESS in parentheses for the 
Lotka Volterra model. 
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Figure G.3: Marginal posterior densities oi 6i (i = 1,..., 10) based on the (thinned) output 
of da-MMH using dataset Vi (solid) and V 2 (dashed). 
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